CIA Country Analysis and Clustering

Source: All these data sets are made up of data from the US government. https://www.cia.gov/library/publications/the-world-factbook/docs/faqs.html

Goal:

Gain insights into similarity between countries and regions of the world by experimenting with different cluster amounts. What do these clusters represent? Note: There is no 100% right answer, make sure to watch the video for thoughts.


Imports and Data

TASK: Run the following cells to import libraries and read in data.

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
In [2]:
df = pd.read_csv('inp_files/CIA_Country_Facts.csv')

Exploratory Data Analysis

TASK: Explore the rows and columns of the data as well as the data types of the columns.

In [3]:
df.head()
Out[3]:
Country Region Population Area (sq. mi.) Pop. Density (per sq. mi.) Coastline (coast/area ratio) Net migration Infant mortality (per 1000 births) GDP ($ per capita) Literacy (%) Phones (per 1000) Arable (%) Crops (%) Other (%) Climate Birthrate Deathrate Agriculture Industry Service
0 Afghanistan ASIA (EX. NEAR EAST) 31056997 647500 48.0 0.00 23.06 163.07 700.0 36.0 3.2 12.13 0.22 87.65 1.0 46.60 20.34 0.380 0.240 0.380
1 Albania EASTERN EUROPE 3581655 28748 124.6 1.26 -4.93 21.52 4500.0 86.5 71.2 21.09 4.42 74.49 3.0 15.11 5.22 0.232 0.188 0.579
2 Algeria NORTHERN AFRICA 32930091 2381740 13.8 0.04 -0.39 31.00 6000.0 70.0 78.1 3.22 0.25 96.53 1.0 17.14 4.61 0.101 0.600 0.298
3 American Samoa OCEANIA 57794 199 290.4 58.29 -20.71 9.27 8000.0 97.0 259.5 10.00 15.00 75.00 2.0 22.46 3.27 NaN NaN NaN
4 Andorra WESTERN EUROPE 71201 468 152.1 0.00 6.60 4.05 19000.0 100.0 497.2 2.22 0.00 97.78 3.0 8.71 6.25 NaN NaN NaN
In [4]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 227 entries, 0 to 226
Data columns (total 20 columns):
 #   Column                              Non-Null Count  Dtype  
---  ------                              --------------  -----  
 0   Country                             227 non-null    object 
 1   Region                              227 non-null    object 
 2   Population                          227 non-null    int64  
 3   Area (sq. mi.)                      227 non-null    int64  
 4   Pop. Density (per sq. mi.)          227 non-null    float64
 5   Coastline (coast/area ratio)        227 non-null    float64
 6   Net migration                       224 non-null    float64
 7   Infant mortality (per 1000 births)  224 non-null    float64
 8   GDP ($ per capita)                  226 non-null    float64
 9   Literacy (%)                        209 non-null    float64
 10  Phones (per 1000)                   223 non-null    float64
 11  Arable (%)                          225 non-null    float64
 12  Crops (%)                           225 non-null    float64
 13  Other (%)                           225 non-null    float64
 14  Climate                             205 non-null    float64
 15  Birthrate                           224 non-null    float64
 16  Deathrate                           223 non-null    float64
 17  Agriculture                         212 non-null    float64
 18  Industry                            211 non-null    float64
 19  Service                             212 non-null    float64
dtypes: float64(16), int64(2), object(2)
memory usage: 35.6+ KB
In [7]:
df.describe().transpose()
Out[7]:
count mean std min 25% 50% 75% max
Population 227.0 2.874028e+07 1.178913e+08 7026.000 437624.00000 4786994.000 1.749777e+07 1.313974e+09
Area (sq. mi.) 227.0 5.982270e+05 1.790282e+06 2.000 4647.50000 86600.000 4.418110e+05 1.707520e+07
Pop. Density (per sq. mi.) 227.0 3.790471e+02 1.660186e+03 0.000 29.15000 78.800 1.901500e+02 1.627150e+04
Coastline (coast/area ratio) 227.0 2.116533e+01 7.228686e+01 0.000 0.10000 0.730 1.034500e+01 8.706600e+02
Net migration 224.0 3.812500e-02 4.889269e+00 -20.990 -0.92750 0.000 9.975000e-01 2.306000e+01
Infant mortality (per 1000 births) 224.0 3.550696e+01 3.538990e+01 2.290 8.15000 21.000 5.570500e+01 1.911900e+02
GDP ($ per capita) 226.0 9.689823e+03 1.004914e+04 500.000 1900.00000 5550.000 1.570000e+04 5.510000e+04
Literacy (%) 209.0 8.283828e+01 1.972217e+01 17.600 70.60000 92.500 9.800000e+01 1.000000e+02
Phones (per 1000) 223.0 2.360614e+02 2.279918e+02 0.200 37.80000 176.200 3.896500e+02 1.035600e+03
Arable (%) 225.0 1.379711e+01 1.304040e+01 0.000 3.22000 10.420 2.000000e+01 6.211000e+01
Crops (%) 225.0 4.564222e+00 8.361470e+00 0.000 0.19000 1.030 4.440000e+00 5.068000e+01
Other (%) 225.0 8.163831e+01 1.614083e+01 33.330 71.65000 85.700 9.544000e+01 1.000000e+02
Climate 205.0 2.139024e+00 6.993968e-01 1.000 2.00000 2.000 3.000000e+00 4.000000e+00
Birthrate 224.0 2.211473e+01 1.117672e+01 7.290 12.67250 18.790 2.982000e+01 5.073000e+01
Deathrate 223.0 9.241345e+00 4.990026e+00 2.290 5.91000 7.840 1.060500e+01 2.974000e+01
Agriculture 212.0 1.508443e-01 1.467980e-01 0.000 0.03775 0.099 2.210000e-01 7.690000e-01
Industry 211.0 2.827109e-01 1.382722e-01 0.020 0.19300 0.272 3.410000e-01 9.060000e-01
Service 212.0 5.652830e-01 1.658410e-01 0.062 0.42925 0.571 6.785000e-01 9.540000e-01

Exploratory Data Analysis

Let's create some visualizations. Please feel free to expand on these with your own analysis and charts!

TASK: Create a histogram of the Population column.

In [8]:
sns.histplot(data=df,x='Population')
plt.show() # Kinija ir indija - 2 maži taškiukai dešinėj padaro zoom out efektą

TASK: You should notice the histogram is skewed due to a few large countries, reset the X axis to only show countries with less than 0.5 billion people

In [10]:
# Billion = million + 000
# so half a billion is (5000000 00) * 2 = billion
sns.histplot(data=df[df['Population']<500000000],x='Population')
plt.show()
# grafike rodo nuo 0 iki 3 * 10^8 kas yra trečdalis biliono

TASK: Now let's explore GDP and Regions. Create a bar chart showing the mean GDP per Capita per region (recall the black bar represents std).

In [11]:
sns.barplot(data=df,x='Region',y='GDP ($ per capita)')
plt.xticks(rotation=90)
plt.show()
# matome regionų turtingumą. Vakarų europą ir šiaurės ameriką yra turtingiausi/produktiviausi regionai
# juoda linija NA regione rodo standart deviation, ilga t.y. didelė, nes Jav gdp per capita yra žymiai didenė
# negu kanados ir meksikos. Realiai parodo, kad turtingumo išsibarstymas didelis tame regione.

TASK: Create a scatterplot showing the relationship between Phones per 1000 people and the GDP per Capita. Color these points by Region.

In [14]:
plt.figure(figsize=(10,6),dpi=200)
sns.scatterplot(data=df,x='GDP ($ per capita)',y='Phones (per 1000)', hue='Region')
plt.legend(loc=(1.05,0.5)) # nurodom (x,y) reikšmes kur padėt legendą
plt.show()
# matome 2 outliers žalius. Pačiam viršuj virš 1000, nors stulpelis yra per 1000 žmonių
# reiškia daugiau telefonų negu žmonių toje šalyje. Taip pat dešinėje, labai didelis GDP per capita
# bet telefonų palyginti su kitom šalim mažai. Paprastai didėjant GDP per capita galvotum, kad telefonų
# kiekis šalyje irgi turėtų proporcingai didėti. 
# Toliau pažiūrim kas čia per šalys.
In [15]:
df[df['GDP ($ per capita)']>50000] # labai turtinga šalis ir labai mažai gyventojų. Dėl to išskirtis šitu atveju
Out[15]:
Country Region Population Area (sq. mi.) Pop. Density (per sq. mi.) Coastline (coast/area ratio) Net migration Infant mortality (per 1000 births) GDP ($ per capita) Literacy (%) Phones (per 1000) Arable (%) Crops (%) Other (%) Climate Birthrate Deathrate Agriculture Industry Service
121 Luxembourg WESTERN EUROPE 474413 2586 183.5 0.0 8.97 4.81 55100.0 100.0 515.4 23.28 0.4 76.32 NaN 11.94 8.41 0.01 0.13 0.86
In [16]:
df[df['Phones (per 1000)']>1000] # 1035 telefonai per 1000 žmonių
Out[16]:
Country Region Population Area (sq. mi.) Pop. Density (per sq. mi.) Coastline (coast/area ratio) Net migration Infant mortality (per 1000 births) GDP ($ per capita) Literacy (%) Phones (per 1000) Arable (%) Crops (%) Other (%) Climate Birthrate Deathrate Agriculture Industry Service
138 Monaco WESTERN EUROPE 32543 2 16271.5 205.0 7.75 5.43 27000.0 99.0 1035.6 0.0 0.0 100.0 NaN 9.19 12.91 0.17 NaN NaN

TASK: Create a scatterplot showing the relationship between GDP per Capita and Literacy (color the points by Region). What conclusions do you draw from this plot?

In [17]:
plt.figure(figsize=(10,6),dpi=200)
sns.scatterplot(data=df,x='GDP ($ per capita)',y='Literacy (%)', hue='Region')
plt.show()
# nėra tiesinės priklausomybės kaip galvotum.
# realiai parodo, kad turint žemą raštingumą labai tikėtina, kad šalis neturtinga, bet jeigu šalis neturtinga
# nereiškia, kad ten žemas raštingumas. Taip pat, jeigu GDP > 1000 tai praktiškai garantuota, kad labai
# raštinga šalis.

TASK: Create a Heatmap of the Correlation between columns in the DataFrame.

In [18]:
sns.heatmap(data=df.corr())
plt.show()
# infant mortality ir birthrate realiai ta pati informacija, tai turėtų būti stipri koreliacija.
# Matome, kad taip ir yra, nes spalva labai panaši į baltą, o balta reiškia corr=1.

TASK: Seaborn can auto perform hierarchal clustering through the clustermap() function. Create a clustermap of the correlations between each column with this function.

In [20]:
sns.clustermap(data=df.corr())
plt.show()
# iškarto galim pažiūrėt hierarchinės klasterizacijos rezultatus ir pamatyti kurie bruožai panašūs.
# matome du didelius klusterius (šviesūs kubai, taip pat matosi iš turnyro schemos)
# ir kad jie priešingi t.y. antikoreliuoja viens su kitu (tamsus kubai)
# pirmasis iš jų susideda iš 'life' bruožų deathrate,birthrate,infant mortality, agriculture
# kitas iš šalies turtingumo įverčių, telefonai,GDP,raštingumas,service
# ir tada daug mažensių klusterių (turnyro schema) susidedančių iš įvairių dalykų

Data Preparation and Model Discovery

Let's now prepare our data for Kmeans Clustering!

Missing Data

TASK: Report the number of missing elements per column.

In [21]:
df.isnull().sum()
Out[21]:
Country                                0
Region                                 0
Population                             0
Area (sq. mi.)                         0
Pop. Density (per sq. mi.)             0
Coastline (coast/area ratio)           0
Net migration                          3
Infant mortality (per 1000 births)     3
GDP ($ per capita)                     1
Literacy (%)                          18
Phones (per 1000)                      4
Arable (%)                             2
Crops (%)                              2
Other (%)                              2
Climate                               22
Birthrate                              3
Deathrate                              4
Agriculture                           15
Industry                              16
Service                               15
dtype: int64

TASK: What countries have NaN for Agriculture? What is the main aspect of these countries?

In [22]:
df[df['Agriculture'].isnull()]['Country']
Out[22]:
3            American Samoa
4                   Andorra
78                Gibraltar
80                Greenland
83                     Guam
134                 Mayotte
140              Montserrat
144                   Nauru
153      N. Mariana Islands
171            Saint Helena
174    St Pierre & Miquelon
177              San Marino
208       Turks & Caicos Is
221       Wallis and Futuna
223          Western Sahara
Name: Country, dtype: object

TASK: You should have noticed most of these countries are tiny islands, with the exception of Greenland and Western Sahara. Go ahead and fill any of these countries missing NaN values with 0, since they are so small or essentially non-existant. There should be 15 countries in total you do this for. For a hint on how to do this, recall you can do the following:

df[df['feature'].isnull()]
In [3]:
# nunulinam visas na reikšmes, ne tik agriculture stulpelyje, nes pamatėme, kad
# ten kur agriculture na, ten mažos salos kurios neturės daugelių įverčių ir kituose stulpeliuose(educated guess)
df[df['Agriculture'].isnull()] = df[df['Agriculture'].isnull()].fillna(0)

TASK: Now check to see what is still missing by counting number of missing elements again per feature:

In [41]:
df.isnull().sum()
# matome, kad ir kituose stulpeliuose pasinaikino nemažai reikšmių, kas yra gerai.
# čia buvo greitas fixas tų mažų salų probelmai
# aišku galima gilintis ir geriau išspręsti (ne visos ten mažos sąlos)
Out[41]:
Country                                0
Region                                 0
Population                             0
Area (sq. mi.)                         0
Pop. Density (per sq. mi.)             0
Coastline (coast/area ratio)           0
Net migration                          1
Infant mortality (per 1000 births)     1
GDP ($ per capita)                     0
Literacy (%)                          13
Phones (per 1000)                      2
Arable (%)                             1
Crops (%)                              1
Other (%)                              1
Climate                               18
Birthrate                              1
Deathrate                              2
Agriculture                            0
Industry                               1
Service                                1
dtype: int64

TASK: Notice climate is missing for a few countries, but not the Region! Let's use this to our advantage. Fill in the missing Climate values based on the mean climate value for its region.

Hints on how to do this: https://stackoverflow.com/questions/19966018/pandas-filling-missing-values-by-mean-in-each-group

In [62]:
df.groupby('Region')['Climate'].mean()
Out[62]:
Region
ASIA (EX. NEAR EAST)                   1.962963
BALTICS                                3.000000
C.W. OF IND. STATES                    2.550000
EASTERN EUROPE                         3.111111
LATIN AMER. & CARIB                    2.033333
NEAR EAST                              1.666667
NORTHERN AFRICA                        1.500000
NORTHERN AMERICA                       1.500000
OCEANIA                                2.000000
SUB-SAHARAN AFRICA                     1.846939
WESTERN EUROPE                         2.826087
Name: Climate, dtype: float64
In [70]:
# Visos Climate stulpelio reikšmės perrašytos regiono climate vidurkiu
df.groupby('Region')['Climate'].transform('mean')
# tas pats tik su lambda funkcija:
# df.groupby('Region')['Climate'].transform(lambda val: val.mean())
# val yra Climate stulpelis
Out[70]:
0      1.962963
1      3.111111
2      1.500000
3      2.000000
4      2.826087
         ...   
222    1.666667
223    1.500000
224    1.666667
225    1.846939
226    1.846939
Name: Climate, Length: 227, dtype: float64
In [71]:
# visos Climate stulpelio reikšmės, bet perrašytos regiono climato vidyrkiu tik tos kur buvo na/null
df.groupby('Region')['Climate'].transform(lambda val: val.fillna(val.mean()))
Out[71]:
0      1.0
1      3.0
2      1.0
3      2.0
4      3.0
      ... 
222    3.0
223    1.0
224    1.0
225    2.0
226    2.0
Name: Climate, Length: 227, dtype: float64
In [4]:
# df[df['Climate'].isnull()]['Climate'] = naujos reikšmės - neveiks, nes filtruojant gaunam slice ir tada
# tam slice stulpeliui priskiriam, originalus df nepasikeis. Taip daryti galima, jeigu papildomai .loc kviečiam
# df.loc[df.['Climate'].isnull(), 'Climate'] = df.groupby ...

# Galima tiesiogiai priskirinėti df['Climate'] = df['Climate'].fillna(funkcija)
# nes fillna pakeičia na reikšmes, bet gražina VISAS stulpelio reikšmes
# ir jos funkcija turi grąžinti visas reikšmes, o ne tik tas kur keisim t.y. kur na
# nes pati fillna funkcija tuo pasirūpina

df['Climate'] = df['Climate'].fillna(df.groupby('Region')['Climate'].transform('mean'))
# arba kaip anksčiau darėm:
# df['Climate'] = df.groupby('Region')['Climate'].transform(lambda val: val.fillna(val.mean()))

# Taigi galim stulpeliui priskirti visas reikšmes,bet su pakeistom na kaip anksčiau
# arba kaip šiuo atveju padarėm, pakeisti visas ir jas paduoti fillna, kuri pakeis tik tas kur na.

TASK: Check again on many elements are missing:

In [73]:
df.isnull().sum()
Out[73]:
Country                                0
Region                                 0
Population                             0
Area (sq. mi.)                         0
Pop. Density (per sq. mi.)             0
Coastline (coast/area ratio)           0
Net migration                          1
Infant mortality (per 1000 births)     1
GDP ($ per capita)                     0
Literacy (%)                          13
Phones (per 1000)                      2
Arable (%)                             1
Crops (%)                              1
Other (%)                              1
Climate                                0
Birthrate                              1
Deathrate                              2
Agriculture                            0
Industry                               1
Service                                1
dtype: int64

TASK: It looks like Literacy percentage is missing. Use the same tactic as we did with Climate missing values and fill in any missing Literacy % values with the mean Literacy % of the Region.

In [5]:
df['Literacy (%)'] = df['Literacy (%)'].fillna(df.groupby('Region')['Literacy (%)'].transform('mean'))

TASK: Check again on the remaining missing values:

In [75]:
df.isnull().sum()
Out[75]:
Country                               0
Region                                0
Population                            0
Area (sq. mi.)                        0
Pop. Density (per sq. mi.)            0
Coastline (coast/area ratio)          0
Net migration                         1
Infant mortality (per 1000 births)    1
GDP ($ per capita)                    0
Literacy (%)                          0
Phones (per 1000)                     2
Arable (%)                            1
Crops (%)                             1
Other (%)                             1
Climate                               0
Birthrate                             1
Deathrate                             2
Agriculture                           0
Industry                              1
Service                               1
dtype: int64

TASK: Optional: We are now missing values for only a few countries. Go ahead and drop these countries OR feel free to fill in these last few remaining values with any preferred methodology. For simplicity, we will drop these.

In [6]:
len(df)
Out[6]:
227
In [6]:
df = df.dropna()
len(df) # išmetėm 6 šalis. Not a big deal for culstering.
Out[6]:
221

Data Feature Preparation

TASK: It is now time to prepare the data for clustering. The Country column is still a unique identifier string, so it won't be useful for clustering, since its unique for each point. Go ahead and drop this Country column.

In [7]:
# šalies pavadinimas nėra bruožas, tai tiesiog labelis
X = df.drop('Country',axis=1)

TASK: Now let's create the X array of features, the Region column is still categorical strings, use Pandas to create dummy variables from this column to create a finalzed X matrix of continuous features along with the dummy variables for the Regions.

In [8]:
X = pd.get_dummies(X)
X.head()
Out[8]:
Population Area (sq. mi.) Pop. Density (per sq. mi.) Coastline (coast/area ratio) Net migration Infant mortality (per 1000 births) GDP ($ per capita) Literacy (%) Phones (per 1000) Arable (%) ... Region_BALTICS Region_C.W. OF IND. STATES Region_EASTERN EUROPE Region_LATIN AMER. & CARIB Region_NEAR EAST Region_NORTHERN AFRICA Region_NORTHERN AMERICA Region_OCEANIA Region_SUB-SAHARAN AFRICA Region_WESTERN EUROPE
0 31056997 647500 48.0 0.00 23.06 163.07 700.0 36.0 3.2 12.13 ... 0 0 0 0 0 0 0 0 0 0
1 3581655 28748 124.6 1.26 -4.93 21.52 4500.0 86.5 71.2 21.09 ... 0 0 1 0 0 0 0 0 0 0
2 32930091 2381740 13.8 0.04 -0.39 31.00 6000.0 70.0 78.1 3.22 ... 0 0 0 0 0 1 0 0 0 0
3 57794 199 290.4 58.29 -20.71 9.27 8000.0 97.0 259.5 10.00 ... 0 0 0 0 0 0 0 1 0 0
4 71201 468 152.1 0.00 6.60 4.05 19000.0 100.0 497.2 2.22 ... 0 0 0 0 0 0 0 0 0 1

5 rows × 29 columns

Scaling

TASK: Due to some measurements being in terms of percentages and other metrics being total counts (population), we should scale this data first. Use Sklearn to scale the X feature matrics.

In [9]:
from sklearn.preprocessing import StandardScaler
In [10]:
scaler = StandardScaler() 
scaled_X = scaler.fit_transform(X)
scaled_X
Out[10]:
array([[ 0.0133285 ,  0.01855412, -0.20308668, ..., -0.31544015,
        -0.54772256, -0.36514837],
       [-0.21730118, -0.32370888, -0.14378531, ..., -0.31544015,
        -0.54772256, -0.36514837],
       [ 0.02905136,  0.97784988, -0.22956327, ..., -0.31544015,
        -0.54772256, -0.36514837],
       ...,
       [-0.06726127, -0.04756396, -0.20881553, ..., -0.31544015,
        -0.54772256, -0.36514837],
       [-0.15081724,  0.07669798, -0.22840201, ..., -0.31544015,
         1.82574186, -0.36514837],
       [-0.14464933, -0.12356132, -0.2160153 , ..., -0.31544015,
         1.82574186, -0.36514837]])

Creating and Fitting Kmeans Model

TASK: Use a for loop to create and fit multiple KMeans models, testing from K=2-30 clusters. Keep track of the Sum of Squared Distances for each K value, then plot this out to create an "elbow" plot of K versus SSD. Optional: You may also want to create a bar plot showing the SSD difference from the previous cluster.

In [11]:
from sklearn.cluster import KMeans
In [83]:
ssd = []

# daug iteracijų gali kainuoti daug laiko
for k in range(2,30):
    model = KMeans(n_clusters=k)
    model.fit(scaled_X)
    ssd.append(model.inertia_)
In [84]:
plt.plot(range(2,30),ssd,'o--')
plt.show()
In [89]:
pd.Series(ssd).diff().plot(kind='bar')
plt.show()
# iš abiejų grafikų matome, kad kai K=3(barplote indeksas 2) sumažėja mažėjimas. 
# Po to kitas didelis sumažėjimas kai K=16
# Nėra teisingo atsakymo, galima į 3 klasterius suskirstyti, galima į 16.
# Realiai reiktų pasirinkti kuri K reikšmė domina ir toliau daryti analizę.

# barplote grynai ieškom mažų stulpelių (nes jie rodo mažus skirtumus, t.y. atvaizduoti diff()/skirtumai)
# ir tada indeksas+1 = K reikšmė

Model Interpretation

TASK: What K value do you think is a good choice? Are there multiple reasonable choices? What features are helping define these cluster choices. As this is unsupervised learning, there is no 100% correct answer here. Please feel free to jump to the solutions for a full discussion on this!.

In [751]:
# Nothing to really code here, but choose a K value and see what features 
# are most correlated to belonging to a particular cluster!

# Remember, there is no 100% correct answer here!

Example Interpretation: Choosing K=3

One could say that there is a significant drop off in SSD difference at K=3 (although we can see it continues to drop off past this). What would an analysis look like for K=3? Let's explore which features are important in the decision of 3 clusters!

In [12]:
model = KMeans(n_clusters=3)
model.fit(scaled_X)
model.labels_
Out[12]:
array([0, 1, 1, 1, 2, 0, 1, 1, 1, 1, 2, 2, 2, 1, 1, 1, 1, 2, 2, 2, 1, 0,
       2, 0, 1, 2, 0, 1, 2, 1, 2, 0, 1, 0, 1, 0, 2, 1, 2, 0, 0, 1, 1, 1,
       0, 0, 0, 1, 0, 2, 1, 2, 2, 0, 1, 1, 1, 1, 1, 0, 0, 2, 0, 2, 1, 2,
       2, 1, 1, 0, 0, 1, 1, 2, 0, 2, 2, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 2,
       2, 2, 1, 1, 1, 1, 2, 2, 2, 2, 1, 2, 2, 1, 1, 0, 1, 1, 2, 1, 1, 0,
       2, 1, 0, 0, 1, 2, 2, 2, 2, 2, 0, 0, 1, 1, 0, 2, 1, 1, 0, 2, 0, 1,
       1, 1, 1, 1, 1, 0, 0, 1, 0, 2, 1, 1, 2, 1, 0, 0, 1, 2, 1, 1, 1, 1,
       1, 1, 1, 1, 2, 2, 1, 1, 1, 2, 1, 0, 1, 1, 1, 1, 1, 1, 2, 1, 1, 0,
       1, 0, 2, 2, 2, 1, 0, 0, 2, 1, 0, 1, 0, 2, 2, 1, 2, 1, 0, 1, 0, 1,
       1, 1, 1, 1, 1, 1, 0, 1, 1, 2, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0,
       0], dtype=int32)
In [95]:
X['Cluster'] = model.labels_
X.corr()['Cluster'].iloc[:-1].sort_values()
Out[95]:
Region_LATIN AMER. & CARIB                   -0.331761
Crops (%)                                    -0.268952
Birthrate                                    -0.231078
Region_OCEANIA                               -0.198612
Region_NEAR EAST                             -0.191334
Region_C.W. OF IND. STATES                   -0.160386
Region_NORTHERN AFRICA                       -0.144986
Agriculture                                  -0.131072
Infant mortality (per 1000 births)           -0.093521
Industry                                     -0.082611
Population                                   -0.078763
Coastline (coast/area ratio)                 -0.072183
Region_ASIA (EX. NEAR EAST)                  -0.060406
Other (%)                                    -0.040395
Area (sq. mi.)                               -0.020773
Region_NORTHERN AMERICA                       0.085462
Region_SUB-SAHARAN AFRICA                     0.100389
Literacy (%)                                  0.123004
Pop. Density (per sq. mi.)                    0.174920
Region_BALTICS                                0.177698
Arable (%)                                    0.214653
Service                                       0.247332
Region_EASTERN EUROPE                         0.297113
Deathrate                                     0.345268
Net migration                                 0.356981
Climate                                       0.410376
Phones (per 1000)                             0.494858
Region_WESTERN EUROPE                         0.553122
GDP ($ per capita)                            0.589339
Name: Cluster, dtype: float64
In [96]:
X.corr()['Cluster'].iloc[:-1].sort_values().plot(kind='bar')
plt.show()
# matome, kad vienas klusteris gali būti interpretuotas kaip turtingos (gdp, wester_europe) šalys
# kitas klasteris, kurio stiprūs bruožai yra (Latin_AMERIKA,birthrates)
# geriausia suprasti kaip čia klasterizuojama atlikus bonus užduoti. Žr. toliau


BONUS CHALLGENGE:

Geographical Model Interpretation

The best way to interpret this model is through visualizing the clusters of countries on a map! NOTE: THIS IS A BONUS SECTION. YOU MAY WANT TO JUMP TO THE SOLUTIONS LECTURE FOR A FULL GUIDE, SINCE WE WILL COVER TOPICS NOT PREVIOUSLY DISCUSSED AND BE HAVING A NUANCED DISCUSSION ON PERFORMANCE!



IF YOU GET STUCK, PLEASE CHECK OUT THE SOLUTIONS LECTURE. AS THIS IS OPTIONAL AND COVERS MANY TOPICS NOT SHOWN IN ANY PREVIOUS LECTURE



TASK: Create cluster labels for a chosen K value. Based on the solutions, we believe either K=3 or K=15 are reasonable choices. But feel free to choose differently and explore.

In [765]:
 
Out[765]:
KMeans(n_clusters=15)
In [766]:
 
Out[766]:
KMeans(n_clusters=3)

TASK: Let's put you in the real world! Your boss just asked you to plot out these clusters on a country level choropleth map, can you figure out how to do this? We won't step by step guide you at all on this, just show you an example result. You'll need to do the following:

  1. Figure out how to install plotly library: https://plotly.com/python/getting-started/

  2. Figure out how to create a geographical choropleth map using plotly: https://plotly.com/python/choropleth-maps/#using-builtin-country-and-state-geometries

  3. You will need ISO Codes for this. Either use the wikipedia page, or use our provided file for this: "../DATA/country_iso_codes.csv"

  4. Combine the cluster labels, ISO Codes, and Country Names to create a world map plot with plotly given what you learned in Step 1 and Step 2.

Note: This is meant to be a more realistic project, where you have a clear objective of what you need to create and accomplish and the necessary online documentation. It's up to you to piece everything together to figure it out! If you get stuck, no worries! Check out the solution lecture.

In [102]:
!pip3 install plotly==5.9.0
Collecting plotly==5.9.0
  Downloading https://files.pythonhosted.org/packages/2d/6a/2c2e9ed190066646bbf1620c0b67f8e363e0024ee7bf0d3ea1fdcc9af3ae/plotly-5.9.0-py2.py3-none-any.whl (15.2MB)
    100% |████████████████████████████████| 15.2MB 2.3MB/s eta 0:00:01    83% |██████████████████████████▉     | 12.7MB 38.9MB/s eta 0:00:01
Collecting tenacity>=6.2.0 (from plotly==5.9.0)
  Downloading https://files.pythonhosted.org/packages/f2/a5/f86bc8d67c979020438c8559cc70cfe3a1643fd160d35e09c9cca6a09189/tenacity-8.0.1-py3-none-any.whl
Installing collected packages: tenacity, plotly
Successfully installed plotly-5.9.0 tenacity-8.0.1
You are using pip version 18.0, however version 21.3.1 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.
In [103]:
!pip3 install --upgrade pip
Collecting pip
  Downloading https://files.pythonhosted.org/packages/a4/6d/6463d49a933f547439d6b5b98b46af8742cc03ae83543e4d7688c2420f8b/pip-21.3.1-py3-none-any.whl (1.7MB)
    100% |████████████████████████████████| 1.7MB 10.5MB/s ta 0:00:01
Installing collected packages: pip
  Found existing installation: pip 18.0
    Uninstalling pip-18.0:
      Successfully uninstalled pip-18.0
Successfully installed pip-21.3.1
In [13]:
import plotly.express as px
In [14]:
iso_codes = pd.read_csv('inp_files/country_iso_codes.csv')
iso_codes
Out[14]:
Country ISO Code
0 Afghanistan AFG
1 Akrotiri and Dhekelia – See United Kingdom, The Akrotiri and Dhekelia – See United Kingdom, The
2 Åland Islands ALA
3 Albania ALB
4 Algeria DZA
... ... ...
296 Congo, Dem. Rep. COD
297 Congo, Repub. of the COG
298 Tanzania TZA
299 Central African Rep. CAF
300 Cote d'Ivoire CIV

301 rows × 2 columns

In [15]:
iso_map = iso_codes.set_index('Country')['ISO Code'].to_dict()
# print partial view of dict
{k: v for i, (k, v) in enumerate(iso_map.items()) if i < 10}
#another way: list(iso_map.items())[:10]
Out[15]:
{'Afghanistan': 'AFG',
 'Akrotiri and Dhekelia – See United Kingdom, The': 'Akrotiri and Dhekelia – See United Kingdom, The',
 'Albania': 'ALB',
 'Algeria': 'DZA',
 'American Samoa': 'ASM',
 'Andorra': 'AND',
 'Angola': 'AGO',
 'Anguilla': 'AIA',
 'Antarctica\u200a[a]': 'ATA',
 'Åland Islands': 'ALA'}
In [16]:
# making another column from country column
df['iso code'] = df['Country'].map(iso_map) # some will be Nan because their not in the map
#also adding cluster column
df['cluster'] = model.labels_
df.head()
Out[16]:
Country Region Population Area (sq. mi.) Pop. Density (per sq. mi.) Coastline (coast/area ratio) Net migration Infant mortality (per 1000 births) GDP ($ per capita) Literacy (%) ... Crops (%) Other (%) Climate Birthrate Deathrate Agriculture Industry Service iso code cluster
0 Afghanistan ASIA (EX. NEAR EAST) 31056997 647500 48.0 0.00 23.06 163.07 700.0 36.0 ... 0.22 87.65 1.0 46.60 20.34 0.380 0.240 0.380 AFG 0
1 Albania EASTERN EUROPE 3581655 28748 124.6 1.26 -4.93 21.52 4500.0 86.5 ... 4.42 74.49 3.0 15.11 5.22 0.232 0.188 0.579 ALB 1
2 Algeria NORTHERN AFRICA 32930091 2381740 13.8 0.04 -0.39 31.00 6000.0 70.0 ... 0.25 96.53 1.0 17.14 4.61 0.101 0.600 0.298 DZA 1
3 American Samoa OCEANIA 57794 199 290.4 58.29 -20.71 9.27 8000.0 97.0 ... 15.00 75.00 2.0 22.46 3.27 0.000 0.000 0.000 ASM 1
4 Andorra WESTERN EUROPE 71201 468 152.1 0.00 6.60 4.05 19000.0 100.0 ... 0.00 97.78 3.0 8.71 6.25 0.000 0.000 0.000 AND 2

5 rows × 22 columns

In [17]:
# reikia paleisti jupyter-notebook su padidintu io_data_rate_limit'u
# kitaip rodys error'a, kad neužtenka defaultinio limito displayint žemėlapį
# jupyter-notebook --NotebookApp.iopub_data_rate_limit=1.0e10
# arba upgradint notebook'o versiją į >=5.2.2

fig = px.choropleth(df, locations="iso code", # needs country iso codes
                    color="cluster", # color by cluster column
                    hover_name="Country", # column to add to hover information
                    )
fig.show()
# matom, kad 3 klusteriai yra - modernios šalys, afrika ir the rest.
In [18]:
fig.write_html("countries_plot.html")